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X  Outline 

•  Virtual  Reality  Call  for  Fire  Training 

•  The  Radiobot-CFF  System 

•  Evaluation  method 

•  Evaluation  Results 

•  Next  Steps 


.  Radiobots:  Project  history 


•  2004:  Piloted  within  ICT  Mission  Rehearsal  Exercise  (MRE)  Project 

•  Simple  dialogue  systems  for  radio  characters 

•  Output  through  radio 

•  2004-2005:  seedling  effort 


•  Further  development  of  MRE  radiobots 

•  Analysis  of  radiobot  domains  &  tools 

•  Focus  on  call  for  fire 

•  Tools  for  data  collection  &  semi-automatic  operation 

•  Initial  data  collection  at  Ft  Sill  and  analysis 

2005  -  2006:  Radiobots  for  JFETS:  Radiobot-jCfltf 
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Radiobotsfor  1 FET5: 
\  Team  members 


•  USC  ICT  (Dr.  David  Traum,  Antonio  Roque,  Susan  Robinson,  Dr  Anton  Leuski, 
Jarrell  Pair,  Tae  Yoon,  Dr  Bilyana  Martinovski,  Ashish  Vaswani,  Sudeep  Gandhe, 
Emily  Flores,  Jillian  Gerten) 

overall  integration  &  management 

dialogue  systems 

corpus  creation  &  development 

evaluation 

•  USC  SAIL  (Dr.  Shri  Narayanan,  Vivek  Sridhar,  Shankar  Anathakrishnan) 

•  speech  processing 

•  TechMasters  Inc  (TMI)  (Bill  Millspaugh) 

FireSIM  XXI  simulation 

Text  to  tactical  messaging  (NLDI) 

•  ARL-HRED  (Charles  Hernandez,  Dr  Janet  Sutton) 

Evaluation 


With  help  from  Ft  Sill  Battle  Lab  &  Techrizon 
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System  Architecture: 


Software  components  and  dataflow 


Human  Voice 

- ► 

(raw  sound  data) 


Speech 

Recognizer 


Text 


steel  one  nine  this  is  gator  nine 
one  tank  in  the  open  over 


Interpreter 


Dialogue  Moves 
and  Parameters 


t 


identification 

fdc-id  =  steel  one  nine 

fo-id  =  gator  nine  one 

target  description 

target-type  =  tank 
target-description  =  in  the  open 


Dialogue 

Manager 


"gator  nine  one"  "steel  one  nine"  MISSION  3  10 
0180  500  0  2  0  100  100  1  91  1  1  1  1  0 

- ► 

NLDI  Command 


FireS  im 


Dialogue  Moves 
and  Parameters 


Confirm  Identification 

fo-id  =  gator  nine  one 

fdc-id  =  steel  one  nine 

Confirm  Target  Description 

target-type  =  tank 


Shot 

Command 


t 


Voice 

(Recording  or  Text-To-Speech) 
Over  Radio 


target-description  =  in  the  open 

Generation 

UTM 

gator  nine  one  this  is 
steel  one  nine  tank  in  the 
open  out 


Display  of 
Explosion 


Example  Radiobot  Interactions 


G91 :  steel  one  niner  this  is  gator  niner  one 
,  adjust  fire  over , 

SI 9:  gator  nine  one  this  is  steel  one  nine  , 
adjust  fire  out , 


G91:  steel  one  nine  this  is  gatgcnine  one 
,  adjust  fire  polar  over ,  J5S 

SI 9:  gator  nine  one  this  is  steel  one  nine 
,  adjust  fire  polar  out , 


G91 :  grid  four  five  one  ,  three  six  four  over 
SI  9:  grid  four  five  one  three  six  four  out , 

G91:  one  z_s_u  in  the  open  ,  i_c_m  in 
effect  over , 

SI 9:  one  z_s_u  in  the  open  ,  i_c_m  in 
effect  out . 


G91:  direction  five  nine  seven  zero 
distance  four  eight  zero  over , 

S19:  direction  five  nine  seven  zero 
distance  four  eight  zero  out , 

G91:  one  b_m_p  in  the  open  , 
d_p_i_c_m  in  effect  over . 


SI 9:  message  to  observer .  kilo  alpha  high 
explosive  four  rounds  .  adjust  fire  target 
number  alpha  bravo  one  zero  zero  zero 
over , 

G91 :  message  to  observer ,  kilo  alpha  , 
high  explosive  in  effect  four  rounds  , 
target  number  alpha  bravo  one  zero 
zero  break  , 

SI  9:  shot  over, 

G91:  shot  out, 

SI 9:  splash  over , 

G91:  splash  out 


SI 9:  one  b_m_p  in  the  open  .  i_c_m  in 
effect  out . 

SI 9:  message  to  observer .  kilo  bravo 
high  explosive  four  rounds  .  adjust  fire 
target  number  alpha  bravo  one  zero 
zero  two  over 

G91 :  message  to  observer ,  kilo  alpha 
quick  in  effect  h_e  four  rounds  ,  target 
number  alpha  bravo  one  thousand 
two  over , 

SI 9:  shot  target  number  alpha  bravo  one 
zero  zero  two  over , 

G91:  shot  out 
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X  Evaluation  Goals 

•  Measures  of  performance  of  system  and  components 

•  Measures  of  effectiveness  of  system  for  use  in  training 


in  the  JFETS  Urban  Terrain  Module 

•  Measures  of  User  Satisfaction 

•  Identify  areas  of  needed  improvement 


X  Evaluation  Metrics 

•  System  Performance  Metrics 

•  mission  completion,  timing  to  fire,  accuracy,  transmission 
quality 

•  Component  Performance  Metrics 

•  ASR,  interpreter,  dialogue  manager,  generator 


•  Subjective  Data 
•  Questionnaires 


.  Evaluation  Conditions 


•  Automated:  radiobot  as  FSO,  automatically  sends 
mission  information  to  Firesim 

•  Semi-automated:  As  above,  but  fills  in  form  for  human 
operator  to  review  (possibly  correct)  and  submit 


•  Human  control:  Human  FSO  engages  in  radio 

dialogues  and  human  operator  sends  missions  through 
Firesim 


X  Evaluation  Sessions 

•  Preliminary  Evaluation  Nov  2005 

•  34  students  in  UTM  training 

•  Focused  on  semi-auto  condition  and  refining 
user  questionnaire 

•  Final  Evaluation  Jan-Feb  2006 

•  29  volunteers  from  Ft  Sill,  some  repeat  subjects 
across  conditions 

•  Demographic  and  user  surveys  for  each  session 

•  2  subjects  per  group,  FO  and  RTO  each  did  2 
missions  then  switched  roles. 

•  Conditions  were  varied  across  groups 


.  Evaluation  Data  Overview 

•  Eval  1 :  Jan  2006 

•  20  sessions  (10  teams) 

•  4  human,  8  semi-auto,  8  auto 


•  Eval  2:  Feb  2006 

•  27  sessions  (14  teams) 

•  6  human,  9  semi-auto,  12  auto 


Evaluation  Results: 

^  Mission  Performance 


•  Average  time  to  fire: 

Human:  1  min  46 
Semi:  2  min  19 
Auto:  1  min  44 

•  Task  completion  rate: 

Human:  100% 

Semi:  98% 

Auto:  86% 


•  Accuracy  rate: 

•  Human:  100% 

•  Semi:  97% 

•  Auto:  92% 


D.  Transmission  Quality 


Session 

System 

Aclcs 

% 

Repair 

Correct 

Flawless 

Flawless 

transmissions 

req 

Aclcs 

Requests 

responses 

Responses 

transmissions 

Wl-2 

27 

12 

100% 

8% 

92% 

58% 

82% 

W3-1 

26 

14 

100% 

14% 

93% 

50% 

73% 

T2-2 

15 

8 

88% 

0 

71% 

71% 

87% 

T4-2 

21 

13 

85% 

0 

91% 

46% 

71% 

T5-2 

67 

39 

97% 

11% 

76% 

53% 

70% 

T6-1 

29 

18 

89% 

0 

75% 

50% 

66% 

T6-2 

13 

6 

100% 

0 

100% 

83% 

92% 

T7-2 

26 

12 

100% 

0 

92% 

75% 

89% 

T9-1 

29 

18 

83% 

27% 

87% 

53% 

72% 

T9-2 

22 

12 

92% 

9% 

100% 

55% 

77% 

Median 

Scores 

26 

12.5 

93.5% 

4% 

91.5% 

54% 

75% 

.  Components  evaluated 

•  Automatic  Speech  Recognizer  (ASR) 

•  Interpreter 

•  ASR  +  Interpreter 

•  Dialogue  Manager 


.  Component  Evaluation  Metrics 

•  Compare  system  results  with  replicable  human  coding  (Gold 
Standard) 

•  Basic  Scoring  Methods 

•  Precision  (correct  recognized/  all  recognized) 

•  Recall  (correct  recognized  /  all  correct) 

•  F-Score  (harmonic  mean  of  P  &  R) 

•  Error  Rate  (errors  /  all  correct) 

•  Dialogue  Measures 

•  Over  whole  dialogue 

9t)  Average  of  scores  of  each  utterance  in  the  dialogue 

nuMutir  fortrrjUn  hrhmohipn  ^  / 
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a.  Example:  ASR evaluation 

•  Transcribed  Utterance  (Exact  reproduction  of  audio  signal) 

steel  one  nine  this  is  gator  niner  one  adjust  fire  over 

•  Output  from  ASR 

steel  one  nine  this  is  gator  one  niner  one  adjust  fire  over 

•  Merged  view 

steel  one  nine  this  is  gator  [one]  niner  one  adjust  fire  over 

•  Measures 

•  Precision  =  1 1/12 

•  Recall  =  11/11 

•  WER  =  1/11 


•  F-Score(  Harmonic  mean  of  Precision  and  Recall)  =  0.957 


■  Evaluation  Results:  ASRscores 

•  Dialogue  precision  score  (DP)  =  0.900 

•  Dialogue  recall  score  (DR)  =  0.920 

•  Dialogue  F  score  (DF)  =  0.910 

•  Dialogue  Word  Error  Rate  (DWER)  =  0.1 14 

•  The  average  precision  score  is  (AvP)  =  0.920 

•  The  average  recall  score  (AvR)  =  0.935 

•  The  average  F  score  (AvF)  =  0.927  I 

•  The  average  word  error  rate  (AvWER)  =  0.097 


.  Interpreter  vs  ASR+lriterpreter 


□  Classifier 
■  ASR+Classifier 
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Interpreter  Evaluation 

•  Interpreter  results  on 
perfect  input  compared 
to  human  coding 

ASR  +  Interpreter 
Evaluation 

•  Interpreter  coding  on 
ASR  output  compared  to 
human  coding 


12/4/2006  use 


Radiobot  Interpreter  performance 
related  to  size  of  training  data 


.  Dialogue  Manager  Evaluation 

•  Comparison  of  Machine  coded  Information  state  against 
human  coded  Information  state. 

•  MACHINE: 

•  has_warning_order  true 
has_target_location  false 
has_grid_location  false 

•  HUMAN: 

•  has_warning_order  true 
has_target_location  false 
has_grid_location  false 


•  DIsER,  DIsP,  DIsR...,  AvIsER,  AvIsP... 


.  Dialogue  Managerscores 

•  Dialogue  Information  State  Error  Rate  (DIsER)  = 
0.0106 

•  Dialogue  Information  State  Precision  (DIsP)  = 
0.9893 

•  Dialogue  Information  State  Recall  (DIsR)  =  0.9893 

•  Dialogue  Information  State  F  score  (DIsF)  =  0.9892 

•  Average  Information  State  Error  Rate  (AvIsER)  = 
0.0106 


•  Average  Information  State  Precision  (AvIsP)  =  0.9893 

•  Average  Information  State  Recall  (AvIsR)  =  0.9893 

•  Average  Information  State  F  Score  (AvIsF)  =  0.9893 


■‘MUMBOMOMD 


.  Questionnaire  Results:  Dialogue 


.  User  Survey  Feedback 

•  Near-human  level  quality  on  understandability  and 
adherence  to  protocol 


•  Subjective  judgments  of  trainee  and  partner  (FO  & 

RTO)  performance  higher  or  the  same  for  Radiobot 
compared  to  human  FSO 


■  Questionnaire  Results  Trainee  Performance 


a  C  unent  Status 

•  Achievements 

•  Allows  large  range  of  mission  types  (e.g.,  adjust  fire,  fire  for  effect, 
offset  from  known  position,  polar,  grid) 

•  Good  performance  on  calls  from  men  with  standard  American  accent 

•  Needs  work: 

•  Improve  recognition  rate  on  Range  of  speakers  (including  female, 
regional  accents,  and  non-native  speakers  (e.g.  coalition  forces) 

•  Improve  error  handling  due  to  recognition  errors 

•  Improve  transparency  and  prompting 


•  E.g.  answer  why  firesim  denies  missions 
•  Hardware  robustness 


X  Next  Steps 

1.  Improving  UTM  Radiobots  to  performance  level 
capability 

•  Suitable  for  use  in  regular  training 

•  Improved  error  handling  and  feedback 

•  Multiple  synchronous  missions 

•  Better  performance  on  wider  range  of  speakers 

•  multiple  use  cases,  trainer  aids,  AAR  aids 

2.  Adaptation  to  other  CFF  domains  &  platforms 

•  Other  parts  of  JFETS 


|  jc»t  Laptop  trainer 


.  Radiobot  Future  Plans 

•  Produce  useful  automation  of  radio 
communication  in  training  simulations 

•  off-load  tasks  from  operator  controller 

•  standardize  training 

•  Extension  to  other  domains 

•  E.g.,  9-line,  sitreps,  fraternal  unit 
communication 


•  Toolkits  for  non-expert  radiobot  construction 
for  new  domains 


.  So  Id  ie  is  with  U1M  Radiobot 


QuickTime™  and  a 
Photo  -  JPEG  decompressor 
are  needed  to  see  this  picture. 


